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ABSTRACT 

This  paper  presents  three  different  methods  to  develop  multilin- 
gual phone  models  for  flexible  speech  recognition  tasks.  The 
main  goal  of  our  investigations  is  to  find  multilingual  speech 
units  which  work  equally  well  in  many  languages.  With  this  uni- 
versal set  it  is  possible  to  build  speech  recognition  systems  for  a 
variety  of  languages.  One  advantage  of  this  approach  is  to  share 
acoustic-phonetic  parameters  in  a HMM  based  speech  recogni- 
tion system.  The  multilingual  approach  starts  with  the  phone 
set  of  six  languages  ending  up  with  232  language-dependent  and 
context-independent  phone  models.  Then,  we  developed  three 
different  methods  to  map  the  language-dependent  models  to  a 
multilingual  phone  set.  The  first  method  is  a direct  mapping  to 
the  phone  set  of  the  International  Phonetic  Association  (IPA).  In 
the  second  approach  we  apply  an  automatic  clustering  algorithm 
for  the  phone  models.  The  third  method  exploits  the  similar- 
ities of  single  mixture  components  of  the  language-dependent 
models.  Like  the  first  method  the  language  specific  models  are 
mapped  to  the  IPA  inventory.  In  the  second  step  an  agglom- 
erative  clustering  is  performed  on  density  level  to  find  regions 
of  similarities  between  the  phone  models  of  different  languages. 
The  experiments  carried  out  with  the  SpeechDat(M)  database 
show  that  the  third  method  yields  in  almost  the  same  recognition 
rate  as  with  language-dependent  models.  However,  using  this 
method  we  observe  a huge  reduction  of  the  number  of  densities 
in  the  multilingual  system. 

1.  INTRODUCTION 

Over  the  last  years  automatic  speech  recognition  systems  have 
reached  a level  of  quality  which  allows  the  introduction  of  com- 
mercial products.  However,  a new  problem  has  occurred:  the 
language-dependency  of  current  recognition  technology.  The 
phonetic  models  used  in  state-of-the-art  systems  are  extremely 
language-dependent.  The  overall  goal  of  our  research  activi- 
ties is  to  create  a multilingual  and  almost  language  independent 
recognition  system  which  works  in  the  most  important  languages 
of  the  world.  We  started  our  multilingual  approach  with  OGI 
MLTS  database  [15]  based  on  the  work  of  [1].  Nowadays,  even 
larger  multilingual  databases  are  available  like  SpeecfiDat(M)1, 
Call-Home  etc.  These  databases  allow  a robust  modeling  of  pho- 
netic units  for  different  languages.  Instead  of  using  language- 
dependent  acoustic  models  our  approach  tries  to  exploit  the 
acoustic-phonetic  similarities  of  sounds  across  languages.  This 
approach  has  two  main  advantages.  First,  the  number  of  HMM 

lFor  information  about  SpeechDat  see  the  following  URL’s: 
http  ://www.phonetik,  uni-muenchen.de/SpeechDat.html 
http://www.icp.grenet.fr/HLRA/home.html 


parameters  can  be  reduced  significantly  if  it  is  possible  to  share 
phone  models  in  different  languages.  Second,  these  multilingual 
models  speed  up  the  process  of  cross-language  transfer.  With 
the  multilingual  phone  models  the  huge  data  collection  process 
can  be  avoided  or  at  least  it  can  be  reduced.  This  paper  shows 
different  approaches  to  achieve  the  goal  to  exploit  the  acoustic- 
phonetic  similarities. 

The  paper  is  organized  as  follows:  First,  we  present  three 
different  methods  to  create  multilingual  phone  models  using 
HMM  technology.  Then  we  perform  our  experiments  with  a 
language-dependent  system  covering  six  languages.  These  mul- 
tilingual experiments  are  then  given  in  the  following  chapter.  At 
the  end  we  give  a summary  of  the  current  research  status  and  a 
perspective  for  future  research  activities. 

2.  MULTILINGUAL  PHONE  MODELING 

This  section  shows  different  approaches  to  find  multilingual 
phone  models  for  automatic  speech  recognition  tasks.  One  cen- 
tral problem  is  to  detect  and  to  exploit  the  acoustic-phonetic  sim- 
ilarities across  languages.  Which  sound  in  one  language  is  sim- 
ilar enough  to  a sound  of  another  language  to  provide  only  one 
common  model?  This  question  leads  to  the  definition  of  a sim- 
ilarity measurement  of  speech  sounds.  The  other  question  is,  if 
the  phone  is  the  optimal  entity  to  exploit  the  similarities.  Or  is 
another  speech  unit  like  a sub  phone  unit  or  a single  density  of 
a continuous  density  HMM  (CDMM)  more  appropriate  to  cre- 
ate multilingual  models.  The  overall  goal  of  the  different  ap- 
proaches to  find  multilingual  speech  units  is  to  generate  models 
which  perform  as  well  as  language-dependent  models  for  differ- 
ent recognition  tasks.  Thus,  we  can  formulate  the  task  to  cre- 
ate accurate  acoustic  models  which  also  exploit  the  similarities 
across  languages. 

2.1.  Mapping  to  the  IPA  based  phone  set  (IPA-MAP) 

The  most  obvious  approach  is  to  map  the  language-dependent 
models  to  the  appropriate  phone  of  the  inventory  of  the  Interna- 
tional Phonetic  Association  (IPA).  Here,  the  phonetic  mapping 
is  performed  with  phonetic  knowledge  rather  than  with  some 
statistical  based  similarity  measurement.  Most  of  the  phonetic 
inventories  which  are  in  use  are  based  on  IPA,  like  SAMPA, 
WORLDBET,  TIMITBET  or  SPICOS.  The  rule  of  the  mapping 
of  the  language-dependent  phones  Phf'[>p  to  the  multilingual 
phone  units  is: 

Phlfp  -»  PhIjPA  (1) 

The  mapping  is  performed  for  each  language.  All  phonetic  seg- 
mentation and  transcription  files  (label  files)  are  transformed  to 
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the  IPA  based  inventory.  After  this  mapping  a Viterbi  based 
HMM  Maximum  Likelihood  training  is  performed.  Figure  1 
shows  the  different  steps  of  the  approach  IPA-MAP. 


loop  over  all  languages 

loop  over  all  phones  of  one  language 

mapping  of  the  language-dependent  phones  to  IPA 
phone: 

Phffp  ->  Ph]PA 

add  to  mapping  file 

transformation  of  the  label-files  using  the  mapping  file 

HMM-training  over  all  languages: 

- HMM-init 

- HMM-Viterbi  training,  6 iterations 

Figure  1 : Algorithm  IPA-MAP 


The  main  advantage  of  this  approach  is  the  simple  way  of  getting 
multilingual  models.  Further,  the  final  IPA-based  models  have 
a clear  representation  in  the  multilingual  context  and  the  cross- 
language transfer  is  also  very  simple.  The  sounds  of  the  new  lan- 
guage can  be  extracted  very  easily  from  the  multilingual  phone 
library.  On  the  other  hand  the  direct  use  of  IPA  does  not  con- 
sider the  spectral  properties  and  the  statistical  similarities  of  the 
phone  models.  Further,  the  IPA-based  units  do  not  model  some 
language-dependent  properties  of  the  sounds.  This  can  yield  in 
a decrease  of  the  accuracy  of  the  acoustic  models.  This  problem 
will  be  more  severe  as  more  languages  will  be  included  in  this 
approach.  Another  disadvantage  is  that  some  inconsistencies  of 
different  phone  systems  of  different  languages  and  inventories 
can  hurt  this  method. 

2.2.  Multilingual  Phone  Clustering  (MUL-CLUS) 

In  this  approach  the  language-dependent  phone  models  are 
mapped  to  a multilingual  set  using  a bottom-up  cluster  algo- 
rithm. Therefore,  a similarity  between  two  phone  models  has 
to  be  defined.  In  this  work  we  apply  a log-likelihood  LL  based 
distance  measure.  The  distance  between  two  phone  models  A; 
and  \j  is: 

Dll(  Xi.Xj)  « LL\-LL)  (2) 

DLL(\i,\j)  = log  p(Xi  | Aj)  — logp(Ai|Aj)  (3) 

where  A{  is  the  model  of  phone  i.  The  data  is  given  by  the  token 
X{.  Respectively,  the  distance  Dll{\  , Ai)  is  given  by: 

£>ll(Aj,  At)  = LL^-LLi  (4) 

DLL{\j,\i)  = log  p(Xj  | Aj ) — log  p(Xj  | Aj ) (5) 

Because  the  distances  are  not  symmetric  we  calculate  the  aver- 
age distance: 


At  each  cluster  step  the  most  similar  pair  of  clusters  are 
merged  to  a new  cluster.  This  means  that  the  two  clusters  Ci 
and  Cj  of  all  cluster  pairs  Ci  and  Cj  with  the  smallest  distance 
are  merged: 

(Ci,  Cj)  = argmin  D(i,  j)  (7) 

C i,Cj 

Because  the  estimation  of  the  new  phone  models  of  the 
merged  cluster  is  difficult  to  achieve  the  distance  is  always  com- 
puted with  the  original  language-dependent  models  which  are 
the  basic  elements  of  one  cluster.  Hence,  the  distance  between 
two  clusters  are  determined  with  the  furthest  neighbor  criterion. 
Therefore,  we  calculate  the  maximum  distance  of  the  initial  clus- 
ters Ck  and  Cf  which  are  in  this  case  the  language-dependent 
phone  units. 

(Ci,Cj)=  argmax  D(k,l)  (8) 

keCijeCj 

The  usage  of  the  furthest  neighbor  criterion  has  also  the  advan- 
tage to  avoid  huge  log-likelihood  calculations.  The  calculation 
of  equation  6 requires  also  the  data  of  the  phone  models.  The 
data  corresponds  to  the  phone  tokens  which  are  extracted  from 
the  phonetic  label  files.  Each  phone  has  a pool  of  tokens  which 
are  used  for  the  distance  calculation.  The  number  of  tokens  of 
each  language-dependent  phone  unit  is  set  to  500. 

The  complete  algorithm  to  create  multilingual  phone  models 
using  clustering  methods  is  given  in  figure  2, 


loop  over  all  languages 


HMM  Viterbi  training 


create  language-dependent  phone  models 


init:  define  a set  of  initial  clusters  from  language-dependent 
phones  Ci  :=  {Phi} 


Compute  a symmetric  distance  matrix 


while  (Dmin  < Dthres) 


find  pair  of  clusters  with  the  minimum  distance  D™jn 


Merge  the  two  clusters  C = Ci  U Cj 


update  the  distance  matrix 


mapping  of  the  language-dependent  phones  to  the 
multilingual  clusters 


HMM-training  over  all  languages: 

- HMM-init 

- HMM- Viterbi  training,  6 iterations 


Figure  2:  Algorithm  to  create  multilingual  phone  models  using 
phone  distance  measurement  and  clustering  (MUL-CLUS) 


Aj)  = )^{DLL(\i,  \j)  4-  DLL(\j,  A*)) 


(6) 
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The  cluster  process  continues  until  all  calculated  cluster  dis- 
tances are  higher  than  a pre  defined  distance  threshold.  Alter- 
natively, the  clustering  stops  if  a specified  number  of  final  clus- 
ters is  achieved.  After  the  clustering  is  finished  we  can  use  the 
cluster  information  to  map  the  language-dependent  models  to 
the  multilingual  inventory.  All  label  files  are  processed  with  this 
mapping  information.  Then  the  HMM  models  are  trained  with 
the  maximum  likelihood  based  Viterbi  training. 

The  automatic  clustering  has  the  advantage  to  use  statistic 
measurement  based  on  HMM  technology  which  is  also  used  dur- 
ing recognition.  The  disadvantage  is  that  the  final  multilingual 
units  lose  some  clear  representation  and  it  is  more  difficult  to 
transfer  this  models  to  a new  language. 

2.3.  IPA-based  Density  Clustering  (IPA-OVL) 

The  previous  two  approaches  try  to  create  complete  multilingual 
phone  models.  This  means  that  all  parameters  (i.e.  sub  phone 
units,  densities  of  a CDHMM)  of  one  model  are  shared  across 
the  different  languages.  On  the  other  hand  there  are  several  lan- 
guage specific  properties  of  the  sounds.  They  exist  due  to  dif- 
ferent phonetic  context,  speaking  style  and  rate,  prosodic  fea- 
tures and  allophonic  variations.  To  cover  these  effects  we  have 
presented  a novel  approach  to  create  multilingual  phone  models 
[15].  Instead  of  complete  overlapping  phone  models  we  assume 
that  there  are  language-independent  realization.  This  approach 
is  achieved  by  using  mixture  densities.  Figure  3 shows  the  idea 
of  this  method.  There  are  regions  of  one  IPA  sound  which  are 
used  in  one,  two  or  three  languages.  In  this  example  the  nasal 


Figure  3:  principles  of  the  method  IPA-OVL  (two  dimensional 
case). 


measure  giving  the  similarity  between  pi  und  /ij  the  weighed 
LI -norm  is  applied: 

D{Xj;Xi)  = wfk  £ (10) 

3 1 d~  i 

In  previous  investigations  we  found  that  is  important  to  normal- 
ize the  distance  by  the  number  of  occurrences  JV*  und  Nj  which 
give  information  how  often  the  densities  are  seen  during  train- 
ing. This  normalization  avoids  the  generation  of  very  big  clus- 
ters which  dominate  the  small  clusters.  One  important  aspect 
is  that  all  clusters  should  have  a similar  number  of  elements. 
Otherwise  the  resulting  clusters  lose  their  power  to  discriminate 
between  different  sounds. 


loop  over  all  IPA-based  phones 

loop  over  all  three  segments  of  one  phone 

create  a pool  of  densities  belonging  to  the  same 
IPA-based  segment  1 

calculate  the  distance  matrix  for  each  pool  of  ! 

densities 

minimum  number  of  densities 


loop  over  all  IPA-based  phones 


loop  over  all  phone  segments  (1,2,3) 


find  pair  of  clusters  Cj  with  the  minimum 
distance  ^ 


merge  the  two  clusters:  C = Ci  U Cj 


and  remove  cluster  C»  and  Cj 


update  distance  matrix 


[ m ] occuring  in  the  languages  German,  Spanish  and  English 
has  mixture  components  which  are  used  in  one,  two  or  all  three 
languages. 

The  creation  of  the  multilingual  models  is  shown  in  figure  4. 
First,  the  language-dependent  models  are  trained  as  before.  Each 
language-dependent  phone  consists  of  3 segments  (sub  phone 
units)  each  modeled  by  a mixture  density.  This  is  expressed  by: 


A mono  ( r»mono  r»mono  nmono  I 

l,p  — 2 ) *-U,p,3  J 


(9) 


where  l is  the  language  index  and  p in  the  phone  index. 

In  the  second  step  the  mixtures  of  the  language-dependent 
segments  which  belong  to  the  same  IPA-based  phone  are  col- 
lected in  one  common  pool  of  densities.  Then  we  apply  an  hier- 
archical agglomerativ  cluster  algorithm  to  find  and  merge  similar 
densities.  The  clustering  is  performed  for  each  segment  sepa- 
rately. 

Because  we  work  in  our  system  with  global  variance  val- 
ues we  use  only  the  mean  vectors  for  clustering.  As  distance 


Figure  4:  Algorithm  to  create  multilingual  mixture  densities 
(IPA-OVL) 


For  each  pool  of  densities  a distance  matrix  is  calculated  using 
equation  10.  After  each  clustering  step  the  overall  number  of 
densities  is  reduced  by  one  element.  The  new  cluster  is  given 
by  the  averaged  mean  vector  of  the  two  merged  clusters.  The 
clustering  is  finished  if  the  complete  system  has  a pre-defined 
number  of  densities.  After  finishing  the  cluster  algorithm  we 
have  for  each  IPA-based  phone  a multilingual  mixture  density. 
Whereas  the  mixture  density  has  multilingual  regions  the  mix- 
ture weights  are  still  language-dependent.  For  the  calculation  of 
the  emission  probabilities  we  use: 
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Hence,  this  approach  has  some  similarities  to  the  semi- 
continuous  HMMS.  However,  here  the  densities  are  shared  only 
for  one  segment  of  one  IPA-based  phone  across  different  lan- 
guages. As  final  step  the  parameters  of  the  multilingual  mixture 
densities  are  reestimated  during  a Viterbi  training.  With  this  kind 
of  multilingual  modeling  we  also  achieve  a huge  reduction  of  pa- 
rameters in  multilingual  system.  The  combination  of  language- 
specific  properties  and  automatically  detection  of  multilingual 
realizations  we  exploit  the  acoustic-phonetic  similarities  in  an 
optimal  way. 

3.  EXPERIMENTS 

In  this  section  we  perform  several  tests  to  compare  the  multilin- 
gual approaches.  First,  we  describe  briefly  the  speech  engine. 
Second,  we  present  the  multilingual  system  using  the  language- 
dependent  models.  This  system  serves  as  comparison  to  the  three 
previously  described  methods. 

3.1.  Description  of  the  HMM-Based  ASR  system 

For  our  investigations  we  use  the  SIEMENS  HMM-based 
speech  engine.  The  feature  extraction  generates  every  10  ms 
a frame  consisting  of  24  mel-scaled  cepstral,  12  A cepstral,  12 
AAcepstral,  1 energy,  1 A energy  and  1 A A energy  compo- 
nents. Each  frame  is  processed  by  a LDA  transformation  re- 
ducing the  51  components  to  24  values.  To  work  in  a multilin- 
gual environment  one  single  LDA  is  calculated  for  all  different 
languages.  The  acoustic  models  are  based  on  Continuous  Den- 
sity HMMs  (CDHMM)  with  Gaussian  density  functions.  In  our 
investigations  we  work  only  with  context-independent  models 
which  consist  of  3 sub-phone  units  (phone  segments).  Each  seg- 
ment is  modeled  by  two  states  with  tied  emission  probability. 

3.2.  Multilingual  System  with  language-dependent  models 

The  multilingual  system  covers  the  six  languages  American  En- 
glish, French,  German,  Italian,  Portuguese  and  Spanish.  The 
speech  material  is  taken  from  the  SpeechDat(M)  and  the  Ma- 
crophone databases.  Because  all  databases  have  only  an  ortho- 
graphic transcription,  all  systems  must  be  bootstrapped  to  gen- 
erate an  initial  segmentation  and  label  files.  The  bootstrapping 
was  carried  out  with  multilingual  phone  models  based  on  the 
IPA-MAP  method.  The  evaluation  and  tests  were  carried  out  on 
word  and  phone  level.  The  word  recognition  rates  are  important 
for  a final  application  and  the  phone  recognition  rates  give  some 
detail  information  about  the  acoustic  modeling  accuracy. 

The  training  of  the  models  is  performed  with  the  phonetic 
rich  sentences  of  the  databases.  This  should  guarantee  the  vo- 
cabulary independence  of  the  acoustic  models.  These  models 
are  also  called  Type-In  models.  The  amount  and  structure  of  the 
training  and  test  material  is  given  in  table  1.  The  training  is  per- 
formed with  more  than  4000  speakers  and  more  than  35K  sen- 
tences. The  duration  of  the  training  material  is  almost  32  hours 
of  pure  speech  without  silence.  The  overall  number  of  language- 
dependent  phone  units  is  232.  Italian  has  the  greatest  number  of 
phones  (49)  because  the  SAMPA  inventory  distinguish  between 
short  and  long  consonants.  Spanish  has  the  smallest  number  us- 
ing only  31  phones.  The  complete  system  has  31999  densities 
which  means  that  in  average  each  of  the  232  language-dependent 
phone  models  have  45  densities. 

After  the  training  the  models  are  tested  on  an  isolated  word 
and  a phone  recognition  task.  The  recognition  results  for  isolated 
words  are  summarized  in  table  2.  The  vocabulary  size  of  this 


#speaker 

tr-dev-te 

#utt. 

Tr-Utt 

hour,  min 
Tr-Time 

# 

phones 

French 

667-166-167 

6.0K 

5.03 

37 

German 

667-166-167 

5. OK 

4.18 

38 

Italian 

667-166-167 

5.8K 

4.15 

49 

Portuguese 

667-166-167 

5.9K 

7.33 

38 

Spanish 

667-166-167 

6. OK 

5.38 

31 

Am.-English 

1000-500-500 

6.4K 

5.12 

39 

Overall 

4335-1330-1335 

35. IK 

31.59 

232 

Table  1;  Structure  of  the  training  and  test  databases  using 
SpeechDat(M)  and  Macrophone:  tr  = number  of  speakers  for 
training;  dev  = number  of  speakers  for  developing  purposes;  te 
= number  of  speakers  for  testing;  Tr-Utt  = number  of  phonetic 
rich  training  sentences;  Tr-Time  = time  and  duration  of  phonetic 
rich  training  sentences;  number  of  phone  units  per  each  language 


Language 

#Rec-. 

Tokens 

Voc. 

Size 

Rec .- 
Rate 

French 

1420 

57 

92.2% 

German 

949 

49 

96.6% 

Italian 

983 

47 

94.4% 

Portuguese 

931 

61 

93.0% 

Spanish 

1242 

70 

93.3% 

Am.-English 

2612 

685 

64.9% 

Average 

- 

- 

89.0% 

Table  2:  Isolated  word  recognition  rate  for  SpeechDat(M)  and 
Macrophone  database;  Rec-Tokens:  number  of  tested  words; 
Voc.  size:  size  of  the  vocabulary  (perplexity);  Rec.  rate:  word 
recognition  rate 

task  varies  between  47  and  70  words  for  the  languages  taken 
from  SpeechDat(M).  For  American  English  the  vocabulary  size 
is  685  because  there  is  no  core  test  set  for  application  words.  The 
best  results  are  achieved  for  German  (96.6%).  Also  for  the  other 
4 European  languages  we  get  results  better  than  90%.  The  result 
for  American  English  is  only  64.9%  due  to  the  high  perplexity 
of  the  recognition  task. 

In  the  second  test  phone  recognition  rates  are  measured.  The 
results  given  in  table  3 including  insertions,  deletions  and  sub- 
stitutions. For  the  continuous  phone  recognition  task  language- 
dependent  bigram  models  are  used  to  achieve  a higher  phone 
accuracy.  It  is  very  obvious  that  for  Spanish  and  Italian  the  best 
phone  recognition  rates  are  achieved  (56.9%  and  53.2%).  Both 
languages  have  a clear  vowel  structure.  Also  for  German,  French 
and  Portuguese  the  recognition  rates  varies  between  47.0%  and 
48.5%.  Only  for  American  English  the  recognition  result  ends 


Language 

#Rec-. 

Tokens 

Voc. 

Size 

Phone 

Acc. 

French 

12964 

37 

48.3% 

German 

12839 

38 

48.5% 

Italian 

10804 

49 

53.2% 

Portuguese 

21751 

38 

47.0% 

Spanish 

17512 

31 

56.9% 

Am.-Englich 

10815 

39 

37.7% 

Average 

- 

- 

48.6% 

Table  3:  Continuous  phone  recognition  rate  for  SpeechDat(M) 
and  Macrophone  including  deletions,  insertions  and  substitu- 
tions 
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LDP 

IPA- 

MAP 

MUL- 

CLUS 

IPA- 

OVL 

French 

92.2% 

90.9% 

90.8% 

92.5% 

German 

96.6% 

91.6% 

94.8% 

96.5% 

Italian 

94.4% 

93.6% 

94.0% 

93.7% 

Portuguese 

93.0% 

89.6% 

91.9% 

91.9% 

Spanish 

93.3% 

92.5% 

93.3% 

93.1% 

Am.-English 

64.9% 

56.5% 

57.0% 

63.2% 

Average 

89.0% 

85.5% 

86.9% 

88.5% 

Table  4:  isolated  word  recognition  rates  using  the  different  mul- 
tilingual approaches 

with  a disappointing  37,7%  rate.  One  reason  for  this  result  could 
be  the  quality  of  the  orthographic  and  phonetic  transcription  of 
the  Macrophone  database.  In  other  investigation  the  results  for 
American  English  are  very  similar  to  results  in  French  or  Ger- 
man [18]. 

Altogether  the  results  on  word  and  phone  level  show  that  it 
is  possible  to  create  task  independent  models  with  phonetic  rich 
training  material.  These  models  are  compared  in  the  following 
section  with  the  multilingual  approaches. 

3.3.  Results  using  the  Multilingual  Approaches 

Table  4 summarizes  the  isolated  word  recognition  rates  of 
the  three  different  approaches  in  comparison  to  the  language- 
dependent  modeling.  For  these  tests  the  number  of  densities 
was  almost  the  same  to  achieve  a fair  comparison.  The  method 
IPA-OVL  outperforms  the  other  two  methods  (IPA-MAP  and 
MULS-CLUS)  and  it  was  nearly  as  good  as  with  the  language- 
dependent  models.  The  decrease  in  recognition  rate  was  only 
0,5%  with  only  13K  densities  instead  of  3 IK  densities  in  the 
language-dependent  case.  Hence,  the  method  IPA-OVL  is  able 
to  detect  and  exploit  the  acoustic-phonetic  similarities  across  the 
phones  of  different  languages.  The  data-driven  phone  cluster- 
ing approach  (MUL-CLUS)  performs  also  better  than  the  direct 
and  simple  mapping  to  the  IPA  inventory.  For  this  two  meth- 
ods which  model  complete  multilingual  phones  the  decrease  of 
recognition  rate  was  3.5%  (IPA-MAP)  and  2.1%  (MUL-CLUS). 
Before  we  give  a final  conclusion  the  detailed  results  of  the  three 
methods  are  discussed. 

IPA-MAP 

The  method  IPA-MAP  maps  the  232  language-dependent  mod- 
els to  95  multilingual  models.  There  are  13  phones  (plosives, 
fricatives  and  nasals)  which  occur  in  all  six  languages.  Table  5 
gives  an  overview  how  many  phones  are  used  in  different  lan- 
guages. This  table  also  shows  that  48  phones  are  still  mono- 
lingual because  they  occur  only  in  one  language.  However,  the 
number  of  system  parameters  is  drastically  reduced.  The  number 
of  densities  decreases  from  31999  to  13555  which  reduces  mem- 
ory and  computational  resources  of  the  multilingual  recognition 
system  significantly.  However,  the  isolated  word  recognition  rate 
decreases  from  89%  to  85.5%. 

Whereas  the  decrease  for  the  four  Romance  languages  is 
small  the  reduction  for  German  and  American  English  is  5.0% 
and  8.4%  respectively.  Possible  explanations  for  this  effect  are: 

• differences  in  the  quality  and  recording  conditions  of  Ma- 
crophone and  SpeechDat(M)  databases: 

Although  a channel  compensation  algorithm  is  used  not 
all  differences  in  the  databases  can  be  removed.  This 
would  at  least  explain  the  reduction  of  the  American  sys- 
tem. 


# La. 

#Ph. 

list  of  phones 

6 

13 

bdfgjklmnpstz 

5 

7 

J d aru  v w 

4 

7 

M Ji  3 eio 

3 

3 

9 JL  tj 

2 

17 

uoetRQYni  aiau 
d3  hi:  s:  x 0 

1 

48 

x q l 0 0:  p e:  J:  y 5 ji: 
dy  31  X:  u n a a v 1 oe  a:  b: 
d:  d3i  dz  e:  ei  f:  g:  j:  J k:  1:  m: 
n:  o:  ou  p:  pf  t J:  t:  ts  u:  u 
v:  w y Di 

Table  5:  Multilingual  inventory  using  IPA-MAP 


• sensitivity  of  the  models  for  big  vocabulary  size: 

If  the  recognition  task  has  a very  high  perplexity  (in  this 
case  it  is  685)  very  exact  acoustic  models  are  required.  A 
small  degradation  of  the  models  yields  in  a severe  reduc- 
tion of  recognition  rate, 

• dominance  of  the  Romance  language  in  comparison  to 
Germanic  languages: 

Four  of  the  six  languages  belong  to  the  Romance  lan- 
guage family.  Hence,  the  multilingual  models  are  dom- 
inated by  the  Romanian  languages.  This  would  explain 
the  decrease  of  the  German  system. 

• Inconsistency  of  the  different  phone  inventories: 

Whereas  for  the  Romance  languages  SAMPA  is  used,  the 
German  lexicon  is  based  on  SPICOS  and  the  American 
lexicon  uses  TIMITBET.  Although  all  inventories  tries  to 
realize  the  IPA-inventory  there  are  some  inconsistencies 
and  problems  during  the  mapping.  For  example  in  SPI- 
COS the  affricates  [ tS  ],  [ dZ  ],  [ pf  ] and  [ ts  ] are  di- 
vided in  two  single  phones.  Also  in  the  CMU-lexicon  we 
observed  some  differences  to  the  other  inventories  which 
could  not  be  resolved  easily.  The  central  phone  [ u ] and 
the  back  vowel  [ a ] have  the  same  phoneme  symbol  / ah  /. 
Hence,  the  same  symbol  / ah  / is  used  to  transcribe  the 
words  “bottom”  / b aa  t ah  m / and  “cut”  / k ah  t /. 

MUL-CLUS 

The  data-driven  method  MUL-CLUS  yields  in  a higher  recog- 
nition rate  than  the  method  IPA-MAP.  Especially  for  German 
the  results  are  much  better.  Instead  of  a reduction  of  5.0%  we 
observe  only  a decrease  of  1.8%.  However,  the  reduction  for 
American  English  is  still  very  obvious  (7.9%).  For  this  experi- 
ment the  final  number  of  multilingual  phone  units  was  chosen  to 
95  to  have  the  same  number  of  phones  as  before.  The  remaining 
clusters  differs  from  the  IPA-based  mapping.  The  biggest  clus- 
ter contains  the  fricatives  [ f ],  [ s ] of  all  six  languages.  Table 
6 shows  a selection  of  generated  phone  clusters.  There  are  also 
some  clusters  which  have  same  elements  as  with  the  IPA-MAP 
method.  These  clusters  contain  the  nasals  [ m ] and  [ n ].  Phones 
which  differ  only  in  the  phonetic  length  are  very  often  mapped 
to  the  same  cluster,  especially  for  consonants.  However,  we  also 
have  50  clusters  with  only  one  element.  This  means  that  we  have 
still  a huge  number  of  monophones.  Further,  experiments  were 
carried  out  with  a varying  size  of  final  multilingual  phone  clus- 
ters. An  observable  decrease  in  recognition  rate  was  observed 
when  the  232  language-dependent  models  were  clustered  to  less 
than  130  multilingual  phones. 

IPA-OVL 

Here  the  clustering  was  performed  on  density  level.  The  final 
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"sci r 

Cluster  elements 

15 

fAE  fSP  f/T  fGE  j-PT  fFR  f.IT  &AE  $GE  &PT  gPP 

s:'T  ssp  s/T  esp 

12 

pAE  p SP  p IT  p FR  p PT  p GE  ^SP  ^ IT  ^ PT  t FR  ^GE  ^ IT 

10 

j AE  yAE  jSP  j/T  XPT  j FR  yGE  j SP  j/T  j PT 

7 

mAE  mSP  mIT  mFR  mPT  „G£  m,IT 

m m m m m m m: 

7 

„AE  SP  _/T  GE  nFR  „PT  „,IT 

n n n n n n n: 

Table  6:  Selection  of  multilingual  phone  clusters  generated  with 
MUL-CLUS 

number  of  densities  was  set  to  13K.  After  the  clustering  process 
there  were  7720  density  clusters  with  more  than  one  element 
(multilingual  clusters)  and  5280  monolingual  clusters.  This 
means  that  25K  of  the  31K  language-dependent  densities  are 
mapped  to  a multilingual  cluster.  The  method  IPA-OVL  shows  a 
significant  improvement  for  the  American  system.  The  decrease 
was  now  only  1.7%  in  comparison  to  the  language-dependent 
case. 

4.  SUMMARY  AND  CONCLUSION 

In  this  paper  we  demonstrated  the  usefulness  and  feasibility  of 
the  multilingual  approach.  First,  a telephone-based  multilin- 
gual speech  recognition  system  was  built  for  6 languages.  The 
language-dependent  phonetic  models  can  be  used  for  a vocabu- 
lary independent  recognition  tasks.  Second,  we  developed  and 
compared  three  different  methods  to  create  multilingual  phone 
models.  The  best  result  was  achieved  with  the  method  IPA-OVL 
which  exploits  the  acoustic-phonetic  similarities  in  an  optimal 
way.  However,  this  method  works  on  the  density  level  rather 
than  on  a complete  phone  level.  Hence,  it  is  important  to  con- 
sider the  language-dependent  properties  of  the  phones  even  if 
they  belong  to  the  same  IPA-based  phone.  The  main  advantage 
of  the  data-driven  methods  are  the  higher  recognition  rate  and 
the  fact  that  the  final  number  of  parameters  can  be  adjusted  dur- 
ing clustering.  In  all  our  investigations  we  used  only  context- 
independent  models.  Now  it  would  be  interesting  to  know  how 
these  methods  would  work  with  context-dependent  models.  Fur- 
ther, more  languages  of  other  language  families  should  be  inte- 
grated in  this  multilingual  approach. 
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